1 Background

In this problem set we will use a small sample of data from the General Social Survey. The survey is designed to monitor changes in both social characteristics and attitudes. You will work with a sample from one neighborhood. The full neighborhood of ALL individuals is the population. For this problem set we do not know the true population parameters for any of the variables, because we do not have data on every person in the neighborhood.

Setup

First load the necessary packages:

# Recall that loading the tidyverse "umbrella" package loads ggplot2, dplyr, and
# readr all at once. Feel free to load these packages any way you choose.
library(tidyverse)
library(moderndive)

Next, load the data set from where it is stored on the web:

if(!dir.exists("./Data")){dir.create("./Data")}
url <- "https://docs.google.com/spreadsheets/d/e/2PACX-1vSypSoDCMH2N76Vo2dZRPkw2q3t1mbvAXlOtgPDIsHg4NclAQFmER-BdvXH9_lrT40UQCVdPXOi_NMJ/pub?gid=257689625&single=true&output=csv"
if(!file.exists("./Data/gss_sample.csv")){ download.file(url, destfile = "./Data/gss_sample.csv")}
gss_sample <- read_csv("./Data/gss_sample.csv")
DT::datatable(gss_sample, rownames = FALSE)

Be sure to take a look at the data in Figure ??. Each row in the data set is a person that was surveyed (100 rows or cases in total). The variables in the data set include each respondent’s age, race, and number of hours of TV watched a day tvhours.

Setting a seed: We will take some random samples and build sampling distributions in this lab. In order to make sure R takes the same random sample every time you run your code, you can do what is called “setting a seed”. Do this in any code chunk that you take a random sample!

You can set a seed like so. Any number will do. (You do not need to run this right now…just showing you how)

set.seed(45)

2 Confidence Intervals from a Bootstrap Resample

Step 1: Take 1000 Bootstrap Resamples

The following code tells R to take 1000 bootstrap resamples from the gss_sample data. You can set the seed to whatever value you like!

set.seed(42)
boot_samp_1000 <- gss_sample %>% 
  rep_sample_n(size = nrow(gss_sample), reps = 1000, replace = TRUE)

Note a few important details about the rep_sample_n function, and bootstrap sampling in general:

  • size = nrow(gss_sample) tells R that each bootstrap resample we take has 100 cases… the size of the original sample.
  • reps = 1000 tells R to take 1000 bootstrap resamples (each of size 100).
  • The replace = TRUE argument tells R that in each bootstrap resample, we can include a row from gss_sample multiple times. So if for instance, respondent # 12 is the first random resample taken here, respondent 12 is still available to be resampled again at random. Thus, some people may appear multiple times in our bootstrap resample, and some people from the original data set may not appear at all.
  • We save the results in a data frame boot_samp_1000.

Take a look at the boot_samp_1000 data frame we just generated in Figure 2.1. Note that the replicate column labels each bootstrap resample (the first 100 rows are labeled 1, the next 100 rows are labeled 2, etc.)

DT::datatable((boot_samp_1000))

Figure 2.1: Bootstrap Distribution


  1. How many rows does boot_samp_1000 have? Why?

Type your complete sentence answer here using inline R code and delete this comment.


Step 2: Calculate the Bootstrap Statistic

Let’s say we want to use the bootstrap resample that we just generated to calculate a confidence interval for the population mean \(\mu_{tv}\) of tvhours. To do so, we need to know the sample mean \(\bar{x}\) of tvhours for each of the 1,000 bootstrap resamples. In this case, the sample mean \(\bar{x}\) of tvhours for each bootstrap resample is our BOOTSTRAP STATISTIC. We can calculate that with three lines of code as follows:

boot_distrib_tv <- boot_samp_1000 %>% 
  group_by(replicate) %>% 
  summarize(stat = mean(tvhours))
# Viewing the data
head(boot_distrib_tv)
replicate stat
1 2.72
2 3.45
3 2.96
4 2.80
5 3.16
6 2.77

Note that:

  • The group_by() argument tells R to take the sample mean of tvhours separately for each different replicate in the bootstrap resample.
  • We put the sample mean for each bootstrap resample in a column called stat

This is the bootstrap distribution for the mean of tvhours!

# or using infer
library(infer)
boot_dist_tv_infer <- gss_sample %>% 
  specify(response = tvhours) %>% 
  generate(reps = 1000, type = "bootstrap") %>% 
  calculate(stat = "mean")
# using a for loop
B <- 1000
bs_mean <- numeric(B)
for(i in 1:B){
  bss <- sample(gss_sample$tvhours, size = sum(!is.na(gss_sample$tvhours)), replace = TRUE)
  bs_mean[i] <- mean(bss)
}

Take a look at the boot_distrib_tv we just created in RStudio’s data viewer.

  1. How many values of the bootstrap statistic stat are there in the object boot_distrib_tv? Please explain why there are this many values of the bootstrap statistic.

Type your complete sentence answer here using inline R code and delete this comment.


Visualizing the Bootstrap Distribution

The bootstrap distribution is shown in Figure 2.2 . This is a histogram of the stat values from boot_distrib_tv.

ggplot(data = boot_distrib_tv, aes(x = stat)) + 
  geom_histogram(color = "white", binwidth = 0.25) + 
  labs(title = "Bootstrap distribution", 
       x = "boostrap statistic (mean tvhours)")
Bootstrap distribution of mean TV hours

Figure 2.2: Bootstrap distribution of mean TV hours

# Or using the infer pipeline
visualize(boot_dist_tv_infer, bins = 9)
Bootstrap distribution computed with `visualize()`

Figure 2.3: Bootstrap distribution computed with visualize()

hist(bs_mean, breaks = "Scott", main = "Bootstrap Distribution",
     xlab = expression(paste(bar(x),"*")), col = "lightblue")
Bootstrap distribution of mean TV hours (base R histogram)

Figure 2.4: Bootstrap distribution of mean TV hours (base R histogram)

Step 3: CI from a Bootstrap Resample

CI Using the 95% Rule

We can now use the bootstrap distribution for the sample mean tvhours \(\bar{x}\) to calculate a 95% confidence interval for the population mean tvhours \(\mu_{tv}\), using the “95% rule for bell shaped distributions”, which states that the middle 95% of values of a bell/normal shaped distribution are between

\[\text{mean} \pm 1.96 \cdot SD\]

  • the mean here is the mean of the original sample
  • the SD here is the standard deviation of the bootstrap distribution, which recall has a special name: the standard error.

We can thus apply the 95% rule, like so:

# Note that z_{0.975} = 1.96 
qnorm(0.975)
[1] 1.959964
(xbar <- mean(gss_sample$tvhours))
[1] 3.14
boot_dist_tv_infer %>% 
  summarize(se = sd(stat), 
            lower_ci = xbar - (qnorm(0.975) * se), 
            upper_ci = xbar + (qnorm(0.975) * se)) -> bnci_tv
bnci_tv
se lower_ci upper_ci
0.3460207 2.461812 3.818188
#
standard_error_ci <- boot_dist_tv_infer %>% 
  get_confidence_interval(type = "se", point_estimate = xbar, level = 0.95)
standard_error_ci
lower_ci upper_ci
2.461812 3.818188

CI Using the Percentile Method

You can also calculate a 95% confidence interval using the percentile method. The logic goes like this:

Since our bootstrap resample had 1000 values of stat:

  • 950 of the stat values fall inside this 95% confidence interval, i.e. 95%
  • 25 values fall below it. i.e. the lower 2.5%
  • 25 values fall above it. i.e. the higher 2.5%

totaling 100%. We can use the quantiles of the bootstrap distribution to find these values as follows:

bpci_tv <- boot_distrib_tv %>% 
  summarize(lower_ci = quantile(stat, 0.025), 
            upper_ci = quantile(stat, 0.975))

bpci_tv
lower_ci upper_ci
2.51 3.89
# Or using infer
boot_distrib_tv %>% 
  get_confidence_interval(type = "percentile", level = 0.95)
lower_ci upper_ci
2.51 3.89
# Which is really just doing the following:
PCI <- boot_distrib_tv %>% 
            summarize(lower_ci = quantile(stat, 0.025),
            upper_ci = quantile(stat, 0.975))
PCI
lower_ci upper_ci
2.51 3.89

This method

  • Asks R to identify the 0.025 quantile of the bootstrap sample means… this is the value below which 2.5% of the values of stat fall (or 25 cases in this example… 25/1000 = 0.025)
  • Asks R to identify the 0.975 quantile for the bootstrap sample means… this is the value above which the other 2.5% of the values of stat fall (or 25 cases in this example 975/1000 = 0.975)
  • The middle 95% of the values fall between these two quantiles

Based on these results, we are 95% confident that the true mean hours of TV watched \(\mu_{tv}\) in the population is between the lower (2.51 hours) and upper (3.89 hours) CI endpoints we just calculated.

Visualizing the Confidence Interval

The bootstrap distribution and the 95% bootstrap percentile confidence intervals we just calculated are shown in the figure below. This is a histogram of the stat values from boot_distrib_tv. The green line is the lower bound of the 95% CI, and the blue line is the upper bound. 950 of the 1000 bootstrap resamples had a mean for tvhours that fell between the green and blue dashed lines…25 of the samples had a mean above the blue dashed line, and 25 of the samples had a mean below the green dashed line.

ggplot(data = boot_distrib_tv, aes(x = stat)) + 
  geom_histogram(color = "black", fill = "pink", binwidth = 0.15) + 
  labs(title = "Bootstrap distribution with 95% CI", 
       x = "boostrap statistic (mean tvhours)") +
  geom_vline(data = bpci_tv, aes(xintercept = lower_ci), color = "green", lwd = 1, lty = "dashed") + 
  geom_vline(data = bpci_tv, aes(xintercept = upper_ci), color = "blue", lwd = 1, lty = "dashed") +
  theme_bw()

boot_dist_tv_infer %>% visualize() +
  shade_confidence_interval(endpoints = PCI, color = "red", fill = "pink")


  1. If we calculated a 90% bootstrap percentile confidence interval for the mean of tvhours using this same bootstrap resample and the percentile method, roughly how many of the 1000 values of tv_mean would fall between the lower_ci and the upper_ci?

Type your complete sentence answer here using inline R code and delete this comment.

# Code to verify here

  1. Use the bootstrap resampling distribution for tvhours generated above (boot_distrib_tv) and the bootstrap percentile method to calculate a 99% bootstrap percentile confidence interval for the mean tvhours. Round your answer to two decimal places. Make sure to use inline R code to report your answer and include appropriate units with the confidence interval.
# Type your code and comments inside the code chunk

Type your complete sentence answer here using inline R code and delete this comment.


  1. Which confidence interval is WIDER: the 95% confidence interval or the 99% confidence interval for the population mean tvhours \(\mu_{tv}\)? Why?

Type your complete sentence answer here using inline R code and delete this comment.


  1. Use the bootstrap resample we generated above (boot_samp_1000), to generate a bootstrap distribution for the sample mean respondent age \(\bar{x}\) instead of tvhours. Please be sure to name it something different than the bootstrap distribution for the sample mean of tvhours
# Type your code and comments inside the code chunk

# Using infer

  1. Calculate a 95% confidence interval for the population mean respondent age \(\mu_{age}\) using the 95% rule method.
# Type your code and comments inside the code chunk

Type your complete sentence answer here using inline R code and delete this comment. Round your answers to two decimal places.


  1. Calculate a 95% bootstrap percentile confidence interval for the population mean respondent age \(\mu_{age}\).
# Type your code and comments inside the code chunk

Type your complete sentence answer here using inline R code and delete this comment.


  1. How do the 95% confidence intervals you calculated in 7 and 8 compare? i.e. are the 95% CI values similar or are they pretty different?

Type your complete sentence answer here using inline R code and delete this comment.


  1. Use the bootstrap resampling distribution for the sample mean respondent age and the percentile method to calculate an 80% confidence interval for the population mean respondent age \(\mu_{age}\).
# Type your code and comments inside the code chunk

Type your complete sentence answer here using inline R code and delete this comment.


3 Bootstrap Sampling Distribution & Confidence Intervals with Categorical Variables

The procedure for generating a bootstrap sampling distribution is VERY similar for categorical data. As an example we will generate a bootstrap sampling distribution for the proportion of respondents that identified as a Person of Color.

Step 1: Take 1000 Bootstrap Resamples

We already did this above! We can use the same boot_samp_1000 as before.

Step2: Calculate the Bootstrap Statistic \(\hat{p}\)

boot_distrib_POC <- boot_samp_1000 %>% 
  group_by(replicate) %>% 
  summarize(n = n(), 
            POC_count = sum(race == "POC"), 
            boot_stat = POC_count/n,
            phat_boot = mean(race == "POC"))
head(boot_distrib_POC)
replicate n POC_count boot_stat phat_boot
1 100 26 0.26 0.26
2 100 24 0.24 0.24
3 100 25 0.25 0.25
4 100 16 0.16 0.16
5 100 28 0.28 0.28
6 100 22 0.22 0.22

Note that with a categorical variable, the code differs in two important respects now:

  • the population parameter that we don’t know, but are inferring about via sampling, is now the population proportion \(p\) that identify as a POC.
  • the sample statistic AKA point estimate that we calculate with the summarize command is now the sample proportion \(\widehat{p}\) rather than a sample mean \(\bar{x}\)
  • To get our proportion \(\widehat{p}\) of ONE of the race categories (POC), we need to first calculate the total sample size for each replicate and the count of how many cases are race == "POC" in each replicate.

Step 3: Generate the 95% Confidence Interval

CI Using the 95% Rule

The following will calculate the 95% confidence interval for the proportion of people that identified as POC using the 95% rule.

phat <- mean(gss_sample$race=="POC")
boot_distrib_POC %>% 
  summarize(se = sd(boot_stat), 
            lower_ci = phat - (qnorm(0.975) * se), 
            upper_ci = phat + (qnorm(0.975) * se))
se lower_ci upper_ci
0.0421354 0.1574161 0.3225839

CI with the Percentile Method

The following will calculate the 95% confidence interval for the proportion of people that identified as “POC” using the percentile method.

boot_distrib_POC %>% 
  summarize(lower_ci = quantile(boot_stat, 0.025), 
            upper_ci = quantile(boot_stat, 0.975))
lower_ci upper_ci
0.16 0.33

  1. Calculate a 95% CI for the population proportion of respondents \(p\) who identified as White using BOTH the percentile and the 95% rule method. Note that you will first need to generate the bootstrap distribution for the proportion of respondents who identified as White.
# Type your code and comments inside the code chunk
# Type your code and comments inside the code chunk
# 95% rule
# Type your code and comments inside the code chunk

Type your complete sentence answer here using inline R code and delete this comment.


4 Confidence Intervals Based on the Theoretical Normal Distribution

As described in moderndive chapter 8.7.2, not only can we generate confidence intervals using a computer/resampling as we’ve been doing until now, in many cases there also exists a mathematical formula! This however necessitates a little mathematical/probability theory; a topic we leave to a more advanced statistics class.

To generate a \((1 - \alpha)\cdot 100\)% confidence interval based on the theoretical normal distribution, we can use the following formula:

\[\widehat{\text{point estimate}} \pm z_{1 - \frac{\alpha}{2}} \cdot \widehat{SE}\]

So, for instance if we wanted to calculate the 95% confidence interval for the population mean of tvhours \(\mu_{tv}\) that respondents watched based on our sample:

  • the point estimate AKA sample statistic in this case would be the sample mean number of tvhours from the sample: \(\bar{x}\)
  • We would estimate the standard error using the formula: where \(s\) is the sample standard deviation, which is a point estimate of the population standard deviation \(\sigma\):

\[\widehat{SE} \approx \frac{s}{\sqrt{n}}\]

and thus a 95% CI \(\rightarrow \alpha = 0.05 \rightarrow z_{1 - \frac{0.05}{2}} \rightarrow z_{0.975} = 1.959964\) would be

\[ \bar{x} \pm 1.96 \cdot \widehat{SE} = \bar{x} \pm 1.96 \cdot \frac{s}{\sqrt{n}} \]

We can perform the calculations in R as follows:

x_bar = mean(gss_sample$tvhours)
gss_sample %>% 
  summarize(sd = sd(tvhours), 
            n = n(), 
            se = sd/sqrt(n), 
            lower_ci = x_bar - qnorm(.975) * se, 
            upper_ci = x_bar + qnorm(.975) * se) -> tci_tv
tci_tv
sd n se lower_ci upper_ci
3.592979 100 0.3592979 2.435789 3.844211

  1. Write down the three 95% confidence intervals for the population mean of tvhours \(\mu_{tv}\) you’ve computed in this problem set. Do this by replacing X, Y, A, B, P, and Q with the appropriate values you’ve computed.

When you are done, make sure all the | in the table still line up so your results print out in a table!

CI construction method lower value upper value
Using boostrap: 95% rule X Y
Using boostrap: percentile rule A B
Using mathematical formula P Q

  1. In your opinion: would you say these three confidence intervals are similar?

Type your complete sentence answer here using inline R code and delete this comment.